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Abstract 

Background: For many predictive applications a large number of models is generated and later clustered in 
subsets based on structure similarity. In most clustering algorithms an all-vs-all root mean square deviation (RMSD) 
comparison is performed. Most of the time is typically spent on comparison of non-similar structures. For sets with 
more than, say, 10,000 models this procedure is very time-consuming and alternative faster algorithms, restricting 
comparisons only to most similar structures would be useful. 

Results: We exploit the inverse triangle inequality on the RMSD between two structures given the RMSDs with a 
third structure. The lower bound on RMSD may be used, when restricting the search of similarity to a reasonably 
low RMSD threshold value, to speed up similarity searches significantly. Tests are performed on large sets of decoys 
which are widely used as test cases for predictive methods, with a speed-up of up to 100 times with respect to all- 
vs-all comparison depending on the set and parameters used. Sample applications are shown. 

Conclusions: The algorithm presented here allows fast comparison of large data sets of structures with limited 
memory requirements. As an example of application we present clustering of more than 100000 fragments of 
length 5 from the top500H dataset into few hundred representative fragments. A more realistic scenario is 
provided by the search of similarity within the very large decoy sets used for the tests. Other applications regard 
filtering nearly-indentical conformation in selected CASP9 datasets and clustering molecular dynamics snapshots. 

Availability: A linux executable and a Perl script with examples are given in the supplementary material 
(Additional file 1). The source code is available upon request from the authors. 



Background 

Computational predictions and simulations of biological 
systems entail a wide variety of processes and length and 
time scales. Going down from ecological systems, to 
organisms, organs and cells and subcellular components, 
the lowest level description of biological systems is in 
terms of single molecules and atoms [1]. At this level, the 
structure and dynamics of biomolecules and biocomplexes 
are of utmost importance in determining their function, 
whose knowledge and elucidation is the ultimate goal of 
structural biology. Experimental methods for structural 
characterization of biomolecules are often too slow or 
have limitations in targets and resolution that cannot be 
overcome. For these reasons one often resorts to computa- 
tional predictions or simulations. 
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A common feature of computational methods is that 
they generate a large number, typically in the range of 
thousands, of molecular models which are meant as 
samples of the large conformational space of a molecule 
or of a complex. 

As a consequence, clustering of different conforma- 
tions of the same molecular structure is a frequently per- 
formed task which allows on the one hand to reduce the 
number of conformations to be subjected to further ana- 
lysis, and on the other hand to choose the most represen- 
tative conformations among many [2]. Clustering is 
mostly performed based on pairwise distance, see e.g. the 
GROMACS package manual for some widely used meth- 
ods (URL: http://www.gromacs.org). In many cases a dis- 
similarity criterion is used instead of a similarity criterion 
(see e.g. for a general discussion [3,4]). These issues are 
well illustrated in the fields of protein structure predic- 
tions and molecular dynamics simulations. 



O© 2012 Fogolari et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons 
BiolVlGCl C^ntrBl Attribution License (http://creativecommons.0rg/licenses/by/2.O), which permits unrestricted use, distribution, and reproduction in 
any medium, provided the original work is properly cited. 



Fogolari ef al. Algorithms for Molecular Biology 2012, 7:16 
http://www.almob.Org/content/7/1 /1 6 



Page 2 of 10 



For instance, prediction of protein structure is typically 
accomplished in two steps: first a large number of plausi- 
ble models (possibly including near-native models) is gen- 
erated and afterwards models are scored and ranked [5]. 
The latter step often considers each model as representa- 
tive of an ensemble of similar conformations and some 
kind of weighted average is performed to pick up the best 
model (see e. g. [6-9]). For what concerns proteins the 
field of structural predictions has been largely explored in 
the last decades and the world-wide experiment CASP 
(Critical Assessment of Structural Predictions) has set 
standards for evaluation and has provided extensive eva- 
luation of methods and approaches [10]. Since 2006, 
among other prediction categ ories, the model Quality 
Assessment (QA) category has been introduced where 
predictors are asked to evaluate the quality of the many 
models (in the range of hundreds) proposed by servers for 
each target sequence. In recent rounds of CASP consensus 
methods have consistently scored better than single struc- 
ture methods. Consensus may be taken on the score given 
by different model quality assessment programs or on the 
similarity of the models within the ensemble itself. Actu- 
ally in the last CASP8 and CASP9 experiments the top 
performing model quality estimators were those using 
clustering/consensus methods on the ensemble of depos- 
ited models. Using the similarity among different predic- 
tive models was recognized earlier as a significant 
improvement factor for assessing the quality of a model 
[11]. Since then methods have known continuous evolu- 
tion up to the recent remarkable achievements in blind 
assessment experiment by the programs QmeanClust 
[12,13], 3D-Jury [14], Peons [15], ModFoldClust [16] and 
others [9]. In these methods the similarity among predic- 
tive models, as well as scores obtained sometimes by dif- 
ferent, possibly independent, sources, is used in different 
ways tailored on the specific model quality assessment 
method used. In some, but not all the methods, only simi- 
larities above a given threshold are used [14]. The reader 
is referred to the original literature for details. 

Somewhat surprisingly, however, a naive consensus 
method ranking the models based on the average similar- 
ity with other models in the ensemble was found to per- 
form like the best consensus model quality estimators [9] 
as earlier suggested by Elofsson and coworkers [11]. Struc- 
tural comparisons appear therefore very important for 
proper choice or scoring of predictive models. 

Similarly to predictive tasks, in molecular dynamics 
simulations typically many conformations are generated, 
in the range of thousands, by taking snapshots of the tra- 
jectory at given time intervals. Here clustering of different 
conformations of the same molecular structure is per- 
formed to reduce the number of conformations to be sub- 
jected to further analysis (e.g. docking simulations), or to 



choose the most representative conformations among 
many [2]. 

The measures of similarity that are typically used for 
proteins have been set in the context of the CASP 
experiment. The most straightforward measure of (dis) 
similarity is the root mean square deviation (RMSD) of 
corresponding atoms after optimal superposition of two 
molecular structures. This measure requires only the 
definition of the set of atoms to be superimposed and 
the set of atoms on which to compute the RMSD. The 
RMSD is considered sometimes less sensitive than other 
measures because well modeled parts of the protein will 
be not represented by the RMSD dominated by large 
deviations and, viceversa, large deviations in small 
regions will not contribute significantly to the RMSD 
computed from averaging over a typically much larger 
set of atoms. 

Several other measures of similarity have been pro- 
posed to give a representation of similarity which 
reflects the fraction of well modeled residues. In this 
respect MaxSub [17], GDT_TS [18] and TM-score [19] 
are considered more appropriate than the RMSD. The 
details and motivations of these similarity scores are 
reported in the original papers and they will not be 
repeated here. Suffice it to say that these methods assign 
a score based on the maximum number of residues that 
can be superimposed at one or more given distance 
threshold. It is worth to remind that these scores are 
assigned through iterative algorithms that require by far 
more computations than a single RMSD computation. 
For this reason we will consider here the widely used 
definition of distance as the RMSD between correspond- 
ing C a (CA) atomic positions after optimal superposi- 
tion. However, as long as a dissimilarity measure is a 
metric the methodology reported hereafter applies. 

Once a distance definition has been chosen, in the con- 
sensus procedures or representative model selections dis- 
cussed above, the all-vs-all pairwise similarity 
computations are straightforward to implement, but the 
time required to analyse a set of, say, 10000 structures 
may become exceedingly large, requiring approximately 
50 x 10 comparisons. 

There are two issues to consider with all-vs-all com- 
parisons. The first is the quadratic number of compari- 
sons and the second is that RMSD computation is time 
consuming itself. 

The latter aspect has been investigated and, compared 
to earlier methods of RMSD calculation [20,21], signifi- 
cantly faster solutions which avoid finding the optimal 
rotation matrix, have been proposed based on quater- 
nion formalism [22]. 

An approximated solution to the all-vs-all comparison 
problem has been proposed by Li and Zhou [23], who 
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avoid finding the N( - N 2 Y) minimum RMSDs altogether, 
but rather choose randomly a reference structure, super- 
pose only once all structures on the reference one (using 
N - 1 rototranslations) and compute the RMSD without 
any further rototranslation. The authors showed that the 
choice of reference structure is almost ininfluential, at 
least on their datasets, and that the RMSD computed in 
this way furnishes a good approximation to the mini- 
mum RMSD. 

Here we improve the efficiency of all-vs-all compari- 
sons by considering that often only good similarities are 
searched for. If a threshold RMSD value is set reason- 
ably low, compared to the average random RMSD in the 
set, many comparisons can be avoided by using the 
inverse triangular inequality satisfied by the RMSD. We 
exploit this fact and we show that a significant speed-up 
is obtained on large decoy sets. An application to clus- 
tering of a very large number of fragments to a repre- 
sentative set is presented as an example. A more 
realistic scenario where different models for the same 
protein (resulting from structural predictions or molecu- 
lar dynamics simulations) is provided by the search of 
similarity within very large decoy sets. Other applica- 
tions like filtering nearly identical structures in a set or 
clustering molecular dynamics snapshots are shown as 
examples. 

1 Methods 

Given a set of N decoys a straighforward similarity 
search would require N x ^ -1) structural comparisons. 
After the first structure is compared with all the others, 
however, the corresponding set of pairwise RMSDs is 
available. We use the set of pairwise comparisons with 
the common reference structure to exclude from com- 
parison those pairs of structures whose RMSD is surely 
above the threshold. 

Let us define RMSD°j" and RMSD tj as the RMSD 
between structures i and /, after optimal superposition 
and with no optimal superposition, respectively. It has 
been shown by Edwards et al. [24] and by Steipe and 
Kaindl [25,26] that RMSD° pt is a metric on the space of 
the classes of equivalent structures (i. e. structures that 
can be superimposed exactly by a rototranslation). As a 
consequence both the triangle inequality 
[RMSDf <= RMSD 0 ? 1 + RMSD 0 *-) and the inverse tri- 
angle inequality {RMSDf >= \RMSDf - RMSD^\ ) 

hold. Note that the metric properties of RMSD opt are 
not trivial as for RMSD because a different rototransla- 
tion is in principle implied by each RMSD° P . As a con- 
sequence of RMSD opt being a metric, the following 
inverse triangle inequality holds: 



RMSDJ > \RMSD'[; - RMSDy \ (1) 

(In the appendix we provide a brief demonstration of the 
inverse triangle inequality for RMSD opt ). Based on the 
inverse triangle inequality, if we have computed RMSD 0 ^ 
with i = 2, .., N it is possible to exclude from further com- 
parison all pairs i and j for which \RMSD°^ — RMSD°y\ is 
larger than the chosen threshold t. By using this principle 
similarity searches may be sped up significantly. 

Let us have an ensemble of N structures to be 
searched for structural similarity and let us assume that 
the distribution of pairwise RMSDs within the ensemble 
is fix) (f{x) has been found by us in many decoys sets to 
be close to a normal distribution). 

The algorithm we will describe iterates over steps 
requiring first a one-vs-all comparison (N - 1 compari- 
sons when we have N structures) and then using the list 
of RMSDs to avoid unnecessary comparisons. In the lat- 
ter task the number of comparisons depends, obviously, 
on the RMSDs distribution, but also on the use we 
make of the already computed RMSDs. 

If the list of RMSDs is sorted, the structures corre- 
sponding to lowest RMSD will be the more distant from 
other structures and the inverse triangular inequality 
will make most comparisons unnecessary. On the other 
hand as we consider structures whose RMSD is in the 
most populated region of the distribution there will be 
many other structures with similar RMSD and the 
inverse triangular inequality will make many fewer com- 
parison unnecessary. 

It will be therefore more efficient to use the inverse 
triangular inequality only on the structures displaying an 
RMSD below a certain value s. 

The number of comparisons N cmp involved at the step 
when N structures have not been compared yet, will be 
the sum of the {N - 1) comparisons needed to generate 
the list of {N - 1) RMSDs and the number of compari- 
sons that make use, through the inverse triangular 
inequality, of the RMSD list computed: 

N cmp = (N - 1) + (N - 1) (N - 2) j'f (x) (j f(x!) cbA dx (2) 

where s is the RMSD where we stop using the inverse 
triangular inequality on the already computed RMSDs 
and we move to the next step calculating RMSDs for a 
single structure and using the newly computed RMSDs. 

After the comparisons have been made the number of 
structures whose RMSD below threshold t with all other 
structures have been found (N done ) will be: 

N done = 1 + (N - 1) [ f(x)dx (3) 

Jo 
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At each step s should maximize the ratio ~tj~~ based on 

the observed distribution of RMSDs. In practice the value s, 
i.e. the maximum RMSD whose corresponding structure is 
compared to all others using the inverse triangular inequal- 

Ndone 

ity, is chosen when the ratio — - — , which is computed 

while the comparisons are done, reaches a maximum. 

Hereafter the implementation described above is 
reported in C-like pseudocode: 

/* Initialization */ 

for(i = 0; i < n_structures ; i+ + ) 
done [ i ] = 0 ; 

/* Iterations */ 

for(il = 0; il < n_structures ; il++) 
{ 

n_cmp = 0 ; 
k = 0; 

for(i2 = il + 1; i2 < n structures; i2 + +) 



{ 



if ( Idone [i2] ) 
{ 

index [k] = i2 

rmsd[k] =RMSD(struct il, struct ±2) 
n_cmp = n_cmp + 1 
if (rmsd <= threshold ) 
output (il, i2, rmsd) 
k = k + 1 



} 



} 



done [ i 1 ] =1 
n_lef t = k 

index_rmsd<- (index, rmsd) 
sort index_rmsd by rmsd 
ratio = 0 

for(j = 0; ( ( j+1) /n_cmp) > ratio && 
j<n_left; j++) 
{ 

ratio = ( (j+1) /n_cmp) 
for (k= j+1; k<n_left; k++) 

if ( ( index_rmsd . rmsd [k] 
index_rmsd . rmsd [ j ] ) <= threshold) 
{ 

rmsd = RMSD ( index_rmsd . index [k] , 
index_rmsd. index [ j ] ) 

n_cmp = n_cmp + 1 

if (rmsd <= threshold) 

output ( index _rmsd . index [k] , 
index_rmsd . index [ j ] , rmsd) 
} 

done [index_rmsd [ j ] .index] = 1 



Better schemes keeping track of all the already com- 
puted RMSDs would require larger memory. The algo- 
rithm has been tested on decoy sets for small proteins 
downloaded from the Decoys'r'us database [27], represent- 
ing a realistic step in a clustering scenario. 

2 Results and discussion 

2.1 Tests on decoy sets 

The effect of the choice of the threshold has been tested 
on the relatively small 4state_reduced decoy datasets [28] 
which include on average 665 decoys for each target 
protein. 

RMSD computation has been performed on all C a car- 
bons. Four reasonable threshold values have been tested 
(2.4, 2.8, 3.2, 3.6 A) with the results reported in Table 1. 
The effect of the RMSD threshold on the number of 
computations required is apparent at the lower end of 
the range explored where 45% of the computations are 
avoided. As expected the smaller the threshold the lower 
number of computations are required. This suggests that 
RMSD computations could be applied iteratively by clus- 
tering structures at increasing RMSD threshold value. 
This idea is applied hereafter. 

The method has been tested on the semfold decoy 
datasets which include on average 12900 decoys and in 
one case more than 20000 decoys [29]. 

The reduction in the number of computations required 

, „ „ , N x (N - 1) . 
compared to the all-vs-all scheme, 1. e. , is 

significant and makes on average 87% of the comparisons 
unnecessary, proving its usefulness (Table 2). Besides the 
speed-up in computation time, the scheme presented 
above has the further advantage of requiring only mem- 
ory proportional to N for storing the list of structure files 
and RMSDs of one structure with all the remaining ones, 
as evident in the pseudo-code detailed above. 

Comparing the results reported in Tables 1 and 2 
there appears to be an effect of the ensemble size N on 
the ratio of the actual computations over N(N - l)/2. 



Table 1 Number of RMSD computations for the 
4state reduced decoy dataset with varying RMSD 
threshold. 



Decoy set 


t = 2.4 A 


t = 2.8 A 


t = 3.2 A 


t = 3.6 A 


N(N-l) 


Ictf 


87,219 


103,780 


1 22,095 


140,094 


1 98,765 


1r69 


130,510 


150,415 


171,407 


1 89,659 


228,150 


1sn3 


106,523 


124,587 


147,101 


174,527 


217,470 


2cro 


183,135 


206,133 


189,277 


221,219 


227,475 


3icb 


1 30,653 


119,774 


1 39,425 


176,184 


213,531 


4pti 


106,581 


131,345 


1 75,432 


191,091 


236,328 


4rxn 


95,306 


117,515 


143,481 


1 90,997 


228,826 


Total 


839,927 


953,549 


1,088,220 


1,283,751 


1,550,545 
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Table 2 Number of RMSD computations for the semfold 
decoy dataset. 

Decoy set This work 



N(N-l) 



ratio 



Ictf 

1e68 

1eh2 

1khm 

Inkl 

1pgb 



11,753,426 

9,039,397 

7,332,361 

22,047,014 

7,966,008 

1 3,465,834 



64,997,1 01 

64,541,841 

65,453,961 

222,193,740 

67,995,291 

63,636,121 



0.18 
0.14 
0.11 
0.10 
0.12 
0.21 



Total 



71,604,689 



548,818,055 



0.13 



For this table the RMSD threshold was 3.0 A. 

This effect is not apparent from equations 2 and 3 
because the integration upper limit s depends implicitly 
on N. Assuming heuristically that the RMSDs are dis- 
tributed according to a simple sine law, equations 2 and 
3 can be integrated and the ratio is found to decrease 
with the size of the ensemble. In order to test this sug- 
gestion, for the largest set (semfold lkhm set, 22000 
structures) we considered sets with 1/2, 1/4, 1/8, 1/16, 
1/32, 1/64 of the structures and plotted the ratio of 
computations over N(N - l)/2 against log{N). The 
dependence is almost linear in the range considered 
(Figure 1). 

2.2 Clustering fragments from the top500H dataset 

As expected the above tests show that for lower RMSDs 
application of the inverse triangular inequality is very 
efficient in reducing the number of comparisons to be 
performed. The idea is exploited here to cluster short 
protein fragments. 
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Figure 1 Ratio of the number of RMSD computations 
performed over N{N - 1)/2 versus the logarithm of the number 
of structures in subsets from the semfold 1 khm decoy set 



Fifirst RMSDs below a very low threshold are com- 
puted. The computations will be many less than 
N x (N - 1) , 

- , because the inverse triangular inequality 

is used in the best way at low thresholds. 

Based on the computed RMSDs fragments are clus- 
tered and each fragment is weighted based on the proxi- 
mity of similar fragments. Among the many schemes 
available we have assigned a weight according to the fol- 
lowing equation: 



it'. 



COS 7T 



RMSD 



2t 



where is the weight before clustering (i.e. 1 in the 
first clustering and greater or equal to 1 in subsequent 
clusterings) and t is the threshold RMSD chosen. This 
scheme is similar to others suggested for weighting simi- 
lar structures [30] or for choosing the most representa- 
tive fragment among similar ones [7,8]. The list of 
fragments is then sorted by the weight and for each 
fragment, starting with the one with larger weight, all 
other fragments with RMSD less than threshold from 
the reference one are clustered together and represented 
by the reference one. 

The RMSD threshold is increased and the procedure 
repeated until a single cluster is found. 

As an example we have taken the dataset obtained by 
considering all continuous five residues fragments from 
the proteins in the top500H dataset [31] which includes 
500 curated non redundant protein structures obtained by 
X-ray crystallography with resolution better than 1.8 A 
and with few deviations from ideal geometry. 

We have chosen a length of five following Micheletti et 
al. [30] who showed that a small dataset of five-residues 
fragments is able to represent accurately all five-residues 
fragments. Many of these fragments are nearly identical 
because of secondary structure elements and therefore the 
task of clustering residues should involve either a filtering 
or a large number of comparisons. The number of frag- 
ments is 107184 which implies, for an all-vs-all compari- 
son ca. 5.7 billion RMSDs computations. 

Our algorithm was able to cluster at various thresh- 
olds the fragments in a couple of hours on a laptop, 
with minimal memory requirements. 

The results are listed in Table 3 where the number of 
starting fragments and the number of representative frag- 
ments is reported together with the number of RMSD 
computations performed and the number of all-vs-all 
computations. 

In table 4 similar figures are reported with the differ- 
ence that superposition is performed on atoms N,CA,C, 
O. The structures grouped in the 16 clusters that 
include more than 1% of the whole fragment dataset 
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Table 3 Fragment clustering. 



RMSD 

threshold (A) 



N. 

struct. 



This work 



N(N-l) 
2 



ratio 



Rep. 
Struct. 



0.05 

0.1 

0.2 

0.4 

0.8 



107,184 57,978,627 

79,994 28,610,140 

47,502 1 9,089,084 

13,066 4,479,481 

1,853 393,824 



5,744,151,336 0.0100 79,994 

3,199,480,021 0.0089 47,502 

1,128,196,251 0.0169 13,066 

85,353,645 0.0525 1,853 

1,715,878 0.2295 131 



Total 



107,184 110,551,156 5,744,151,336 0.0193 



Columns report the threshold RMSD chosen for clustering, the number of 
starting fragments, the number of RMSD computations done, the number of 
computations in an all-vs-all comparison, the number of representative 
fragments, used as starting fragments at the next iteration. CA atoms have 
been used for superposition. 



have been superimposed to the representative structure 
and they are displayed in Figure 2. Typical conforma- 
tions of a-helices and /3-strands are found in the first 
and second group, respectively, as expected. Other 
groups are associated with tight turns, with a-helix and 
/J-strand terminal motifs and with different additional fi- 
strand conformations. 

From Tables 3 and 4 it is immediately seen how effec- 
tive is the fast structural similarity search proposed in 
this work compared to all-vs-all comparisons. The num- 
ber of actual computations is just 2% and 6% of that 
implied by all-vs-all comparisons for data in Tables 3 
and 4, respectively. 

Overall these results confirm the reliability of the 
methodology whose implementation is made possible by 
the fast computation of structural similarities. 

2.3 Filtering predictive models for model quality 
assessment 

The tests performed on decoy sets are representative of 
a possible clustering scenario in the context of most 
representative model selection. In the context of the 
CASP esperiment the models available are in the range 



Table 4 Fragment clustering on backbone. 



RMSD 

threshold (A) 



N. 

struct. 



This work 



N(N-1) 
2 



ratio 



Rep. 
Struct. 



0.05 

0.1 

0.2 

0.4 

0.8 

1.6 

3.2 



107,184 53,793,796 5,744,151,336 0.0094 105,299 

105,299 124,506,221 5,543,887,051 0.0225 87,195 

87,195 57,001,567 3,801,440,415 0.0150 63,829 

63,829 65,723,340 2,037,038,706 0.0322 21,637 

21,637 33,375,313 234,069,066 0.1426 2,445 

2,445 2,586,436 2,987,790 0.8657 33 

33 528 528 1.0000 2 



Total 



107,184 336,987,202 5,744,151,336 0.0587 



Columns report the threshold RMSD chosen for clustering, the number of 
starting fragments, the number of RMSD computations done, the number of 
computations in an all-vs-all comparison, the number of representative 
fragments, used as starting fragments at the next iteration. Atoms N, CA, C, O 
have been used for superposition. 



of few hudreds and therefore the advantage of our 
method is limited. We address here another application 
consisting in filtering most similar models before apply- 
ing consensus methods. Consensus methods rely on the 
independence of the predictive models. Although in 
principle it would be desirable to define a distance 
based on most used similarity measures like MaxSub 
[17], GDT_TS [18] or TM-score [19] it is difficult to 
enforce in the latter measures the metric properties (in 
particular the triangular property) that are needed for 
the present method to work. For this reason we used 
the RMSD as pairwise distance. Since the models may 
be of different, even non overlapping, lengths we nor- 
malized the RMSD on the number of the aligned resi- 
dues according to simple scaling proposed by Carugo 
and Pongor [32]: 



RMSD 



RMSD 



100 



1 + ^log(^) 



where RMSD 100 is the estimated RMSD when 100 
residues are aligned and N is the number of aligned 
residues. When N is less than 14 the RMSD is set to a 
maximum value. 

We chose all server predictions for FM (free model- 
ing) targets in the recent CASP9 experiment (i.e. T0529, 
T0531, T0534, T0537, T0544, T0547, T0550, T0553, 
T0555, T0561, T0571, T0578, T0581, T0604, T0608, 
T0616, T0618, T0621, T0624, T0629, T0637 and 
T0639). We assume that FM models that display a nor- 
malized pairwise RMSD below 1.0 A are not indepen- 
dent and therefore only one representative model 
should be considered. In this way all nearly identical 
models are identified. In the sets considered there are 
many models (deposited by the same predictors) that 
are identical. 

The effectiveness of the procedure may be judged by 
considering the number of comparisons performed 
(115399) versus those to be performed in an all-vs-all 
comparison (1040520) which amounts to a ratio of 0.11. 
The procedure allows to remove on average 40 predic- 
tive models that have a nearly identical model in each 
set. 

2.4 Clustering molecular dynamics snapshots 

The analysis of molecular dynamics trajectories often 
involves the comparison and clustering of thousands of 
snapshots. Clustering is mostly based on the analysis of 
pairwise RMSDs. When only closest similar conforma- 
tions are to be grouped together the method proposed 
here strongly reduces the number of comparisons to be 
performed. 

In the following two illustrative applications of the 
method are given. 
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Figure 2 Fragments superposed on the representative fragment for the 16 cluster populated with more than 1% of the whole 
data set. 



We consider the decoy set vhp mcmd [33], a decoy set 
containing alternative conformations for the villin head- 
piece obtained from snapshots of long molecular 
dynamics runs in explicit solvent starting from five differ- 
ent conformations, including the native one. The decoy 
set includes 6255 conformations. In order to choose the 
most representative ones we consider similarities at 1.0 A 
threshold. The search at this similarity level is performed 
with 1717823 comparison, i.e. 0.088 times the compari- 
sons required by a straightforward all-vs-all comparison 
(6255 x 3127). Clustering of structures based on the 



obtained similarities allows to identify for each of the five 
simulations a representative structure. Among these the 
largest cluster (245 out of 651 structures) is found for the 
simulation starting from the native structure, which high- 
lights the lesser conformational dispersion for simula- 
tions starting from the native structure with respect to 
those starting from non-native structures. 

We consider as another application the analysis of the 
simulation of cis-trans isomerization of Pro 32 in /?2- 
microglobulin that drives a switch in the hydrogen 
bonding network of residues Arg 3, Thr 4, His 31, Pro 
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32, His 84 and Thr 86 [34]. We consider the pairwise 
RMSD after superposition of all the N, CA, C and O 
atoms of the above six residues. We set the threshold at 
0.4 A to spot significantly different conformations. The 
number of comparison performed is 0.25 of those 
required by an all-vs-all comparison. The first two most 
representative clustered conformations are relative to 
the native cis Pro 32 conformation and non-native trans 
Pro 32 conformation, thus allowing selection of most 
representative conformations at the selected region. 

3 Conclusions 

The proposed method for fast structural similarity 
search below a given distance threshold makes use of 
the inverse triangular disequality in order to avoid unne- 
cessary comparisons. 

The applications reported here shows that the method 
is able to save many comparisons whenever the distance 
threshold is set much smaller than the average distance 
between structures in the set. 

The advantage of the method compared to all-vs-all 
comparisons is more and more evident with larger 
datasets. 

The method seems therefore mostly suited to calculate 
fastly similarities among large ensembles of conformers, 
e.g. those obtained by predictive softwares like Rosetta 
[35] where thousands of models are typically generated, 
or for large structural database analysis. In this respect 
the first three applications described in the Results sec- 
tion appear particularly suited for the method. Analysis 
of molecular dynamics snapshot is also an important 
field of application. The only drawback of the method is 
that the threshold distance between conformations must 
be much lower than the average distance for the method 
to be efficient. In this respect the comparison of confor- 
mations of subsystems (that display conformational 
mobility) or comparison at a very low threshold distance 
(aiming at identifying nearly identical conformers) 
appear to be the situation of choice for the application 
of the method, as shown in the examples of the Results 
section. 

The main limitation of the method is that it relies on 
the use of the metric properties of the dissimilarity mea- 
sure. For this reason it is not evident how to extend the 
application in order to make use of useful similarities 
measures like MaxSub [17], GDT_TS [18] and TM- 
score [19]. Turning the latter measures into proper 
metrics is a prerequisite for using the method in predic- 
tive contexts like that set up in the CASP experiment. 
On the other hand it is apparent that the method may 
be used also in other contexts whenever a metric can be 
defined and similarities at low distances (compared to 
the average pairwise distance) are sought. 



4 Availability 

A reference program for Linux, using, with minor modi- 
fications, the RMSD calculation routines of Theobald 
[22], is available as supplementary material (Additional 
file 1). The source code is available upon request from 
the authors. A Perl script implementing the algorithm is 
provided with a subroutine distQ that wraps any exter- 
nal measure of distance. 
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Appendix 

Let us consider two structures i and / which have been 
optimally superposed on structure 1, so that RMSD 0 ^ 
and RMSD 0 ^ are known. We look for a lower bound on 
RMSD°y based on RMSDf. and RMSDf.. 

First, we traslate all structures in such a way that the 
center of geometry of atoms to be superposed is the 
same for all structures. Second we write the RMSD 



opt 



[RMSDT) 2 = — — '— — 



1 12 



(4) 



(5) 



N 



where r ik and rj k are the vectors of atom k in structure 
i and /', respectively, and R^j is the rotation that super- 
poses optimally, i.e. with the least RMSD, structure i 
onto structure /. 

If we arrange the 3 x N atomic coordinates r ik in a 
single vector v, and build a 3AT x 3A/" block diagonal 
matrix R by repeating the 3x3 rotation matrix R N 
times along the diagonal, we can simplify the above 
notation: 



VN x RMSD"? = 



(6) 



(7) 



Let us define v 1 as 

O^fc. Note that 

VNRMSD°V = IIR^! - VjW = llCiOO " v?) II 
because the norm is invariant under rotations. 
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Now we can rewrite equation (6) and derive a lower 
bound for RMSDT based on RMSDf and RMSD 0 ^ . In 
the next equation we assume that RbASD 0 * > RMSD 0 ^ 
without loss of generality. 

VN x RMSDf = - R^OH (8) 

= lift -vj + & -^OH (9) 

The inverse triangle inequality applies to euclidean 
distance: 

ii^, - ~ Vl ) + (5; - > lift - 5;n - us; - R^flW (10) 

By rendering explicit the second term in the right- 
hand member of the inequality we get: 

lift - "ill - Ift - T$M\ = lift - Sill - llfi^ifR&jWi - 5,)lll (11) 

= \*/NRMSDvi - VNRMSD°y\ (12) 

Note that in the latter equation the first term is the 
RMSD between rototraslated structure 1 and structure i 
with no optimal superposition. Since it is assumed that 
RMSD"? > RMSD"? the following inequality holds: 

\VNRMSD vi - •jNRMSDf j '\ > Vn\RMSD^/ - RMSD**] 

As a consequence of the above equations the inverse 
triangular inequality holds: 

RMSD"^ > \RMSDf- - RMSD°f?\ (13) 



Additional material 



Additional file 1: Zipped le with the Linux executable and example 
files 



7 Acknowledgements 

We thank Dr. I. M. Lait for valuable support. 

Received: 5 September 2011 Accepted: 29 May 2012 
Published: 29 May 2012 

References 

1 . Fogolari F: Computers and Biosciences. Frascati Physics Series 2006, 
40:275-280. 

2. Leach AR: Molecular modelling: principles and applications. 2 edition. Harlow, 
Essex, UK: Pearson Education Limited; 2001. 

3. Jain AK, Murty NM, Flynn PJ: Data clustering: a review. ACM Comput Surv 
1999, 31:264-323. 

4. Zhao Y, Karypis G: Data clustering in life science. Moi Biotechnol 2005, 
31:55-80. 

5. Bonneau R, Baker D: Ab initio protein structure prediction: progress and 

prospects. Ann Rev Biophys Biomol Struct 2001, 30:173-189. 



6. Shortle D, Simons KT, Baker D: Clustering of low-energy conformations 
near the native structures of small proteins. Proc Natl Acad Sci 1998, 
95:11158-11162. 

7. Xiang Z, Soto CS, Honig B: Evaluating conformational free energies: the 
colony energy and its application to the problem of loop prediction. 

Proc Natl Acad Sci USA 2002, 99:7432-7437. 

8. Fogolari F, Tosatto SC: Application of MM/PBSA colony free energy to 
loop decoy discrimination: toward correlation between energy and root 
mean square deviation. Protein Sci 2005, 14:889-901. 

9. Kryshtafovych A, Fidelis K, Tramontano A: Evaluation of model quality 
assessment predictions in CASP9. Proteins: Struct Funct Bioinf20^^, 
79S:91-106. 

10. Moult J: A decade of CASP: progress, bottlenecks and prognosis in 
protein structure prediction. Curr Op Struct Biol 2005, 15:285-289. 

11. Lundstrom J, Rychlewski L, Bujnicki J, Elofsson A: Peons: a neural-network- 
based consensus predictor that improves fold recognition. Protein Sci 
2001, 10:2354-2362. 

12. Benkert P, Tosatto SC, Schomburg D: QMEAN: A comprehensive scoring 
function for model quality assessment. Proteins 2008, 71:261-277. 

13. Benkert P, Schwede T, Tosatto SC: QMEANclust: estimation of protein 
model quality by combining a composite scoring function with 
structural density information. BMC Struct Biol 2009, 9:35. 

14. Ginalski K, Elofsson A, Fischer D, Rychlewski L: 3D-Jury: a simple approach 
to improve protein structure predictions. Bioinformatics 2003, 
19:1015-1018. 

15. Larsson P, Skwark MJ, Wallner B, Elofsson A: Assessment of global and 
local model quality in CASP8 using Peons and ProQ. Proteins 2009, 
77(Suppl 9):1 67-1 72. 

16. (VlcGuffin U: The ModFOLD server for the quality assessment of protein 
structural models. Bioinformatics 2008, 24:586-587. 

17. Siew N, Elofsson A, Rychlewski L, Fischer D: MaxSub: an automated 
measure for the assessment of protein structure prediction quality. 
Bioinformatics 2000, 16:776-785. 

18. Zemla A: LGA: A method for finding 3D similarities in protein structures. 
Nucleic Acids Res 2003, 31:3370-3374. 

19. Zhang Y, Skolnick J: Scoring function for automated assessment of 
protein structure template quality. Proteins 2004, 57:702-710. 

20. Kabsch W: A solution for the best rotation to relate two sets of vectors. 
Acta Crystallogr A 1 976, 32A:922-923. 

21. Kabsch W: A discussion of the solution for the best rotation to related 
two sets of vectors. Acta Crystallogr A 1 978, 34:827-828. 

22. Theobald DL: Rapid calculation of RMSDs using a quaternion-based 
characteristic polynomial. Acta Crystallogr A 2005, 61:478-480. 

23. Li H, Zhou Y: SCUD: fast structure clustering of decoys using reference 
state to remove overall rotation. J Comput Chem 2005, 26:1 189-1 192. 

24. Edwards J, Chatham G, Glunt W, McDonald D, Wells C, Hayden T: Sampling 
properties of the alternating projection distance geometry algorithm 
applied to unconstrained polypeptide chains. Computers Chem 1997, 
21:115-124. 

25. Steipe B: A revised proof of the metric properties of optimally 
superimposed vector sets. Acta Crystallogr A 2002, 58:506. 

26. Kaindl K, Steipe B: Metric properties of the root-mean-square deviation of 
vector sets. Acta Crystallogr A 1997, 53:809. 

27. Samudrala R, Levitt M: Decoys 'R' us: a database of incorrect protein 
conformations to improve protein structure prediction. Protein Sci 2000, 
9:1399-1401. 

28. Park B, Levitt M: Energy function that discriminate X-ray and Near-native 
folds from well-constructed decoys. J Moi Biol 1996, 258:367-392. 

29. Samudrala R, Levitt M: A comprehensive analysis of 40 blind protein 
structure predictions. BMC Struct Biol 2002, 2:3-18. 

30. (vlicheletti C, Seno F, Maritan A: Recurrent oligomers in proteins: an 
optimal scheme reconciling accurate and concise backbone 
representations in automated folding and design studies. Proteins 2000, 
40:662-674. 

31. Lovell SC, Davis IW, Arendall WB, de Bakker PI, Word JM, Prisant MG, 
Richardson JS, Richardson DC: Structure validation by Calpha geometry: 
phi,psi and Cbeta deviation. Proteins 2003, 50:437-450. 

32. Carugo 0, Pongor S: A normalized root-mean-square distance for 
comparing protein three-dimensional structures. Protein Sci 2001, 
10:1470-1473. 



Fogolari ef al. Algorithms for Molecular Biology 2012, 7:16 
http://www.almob.Org/content/7/1 /1 6 



Page 10 of 10 



33. Fogolari F, Tosatto SC, Colombo G: A decoy set for the thermostable 
subdomain from chicken villin headpiece, comparison of different free 
energy estimators. BMC Bioinformatics 2005, 6:301. 

34. Fogolari F, Corazza A, Varini N, Rotter M, Gumral D, Codutti L, Renneila E, 
Viglino P, Bellotti V, Esposito G: Molecular dynamics simulation of B2- 
microglobulin in denaturing and stabilizing conditions. Proteins 2011, 
79:986-1001. 

35. Leaver-Fay A, Tyka M, Lewis SM, Lange OF, Thompson J, Jacak R, 
Kaufman K, Renfrew PD, Smith CA, Sheffler W, Davis IW, Cooper S, 
Treuille A, Mandell DJ, Richter F, Ban YE, Fleishman SJ, Corn JE, Kim DE, 
Lyskov S, Berrondo M, Mentzer S, Popovic Z, Havranek JJ, Karanicolas J, 
Das R, (Vleiler J, Kortemme T, Gray JJ, Kuhlman B, Baker D, Bradley P: 
ROSETTA3: an object-oriented software suite for the simulation and 
design of macromolecules. Meth Enzymol 201 1, 487:545-574. 



doi:10.1186/l748-7188-7-16 

Cite this article as: Fogolari ef al:. Fast structure similarity searches 
among protein models: efficient clustering of protein fragments. 

Algorithms for Molecular Biology 2012 7:16. 



Submit your next manuscript to BioMed Central 
and take full advantage of: 

• Convenient online submission 

• Thorough peer review 

• No space constraints or color figure charges 

• Immediate publication on acceptance 

• Inclusion in PubMed, CAS, Scopus and Google Scholar 

• Research which is freely available for redistribution 



Submit your manuscript at f \ RioM _-| rpntr ,i 

www.biomedcentral.com/submit \ J BHMWea t-entrai 



